This is the assignment of David Utassy on the course Data Science 1. This assignment was developed in this GitHub repository
After importing the data, the first task is to explore it. Regarding the outcome variable, we can see on the following histograms that it is worth dealing with the logTotalValue as its distribution is more normal.
On the other hand, during the exploratory data analysis, I figured out that the following predictor variables are right-skewed a lot, therefore made a log transformation on them.
(LotArea, BldgArea, ComArea, ResArea, OfficeArea, RetailArea, GarageArea, StrgeArea, FactryArea, OtherArea, NumBldgs, NumFloors, UnitsRes, UnitsTotal, LotFront, BldgDepth, BuiltFAR, UnitsRes, UnitsRes)
Furthermore, I created a correlation matrix plot, in order to have a feeling on correlations and also highlighted some pairs with the ggpair() function. As the goal is prediction, I have used all the variables apart from TotalValue as predictors in order to achieve the best result.
In the following code snippet I have created a training and a test set, assigning 70% of observations to the training set. Note, that the task asked for 30% to the training set. In the end I have chosen to take 70% as due to my experiment in that case, the models got better that way, however as a con I would like to highlight that this results in a longer runtime. Additionally, I predefined trainControl for the caret train function in order to make 10-fold cross-validation with all models.
set.seed(1234)
training_ratio <- 0.7
train_indices <- createDataPartition(
y = data[["logTotalValue"]],
times = 1,
p = training_ratio,
list = FALSE
) %>% as.vector()
data_train <- data[train_indices, ]
data_test <- data[-train_indices, ]
fit_control <- trainControl(method = "cv", number = 10) #, selectionFunction = "oneSE")
In the following code snippet I have used the caret package to train a prediction model using 10-fold cross-validation. The final model’s results will be published later in comparison with other models.
set.seed(857)
linear_fit <- train(
logTotalValue ~ . -TotalValue,
data = data_train,
method = "lm",
preProcess = c("center", "scale"),
trControl = fit_control
)
In the following code snippet I am creating a ridge model by setting the alpha parameter to zero in the caret train function. By 10-fold cross-validation the caret package calculates to optimal regularization parameter (lambda) which is the weight of the penalty on the sum of squares of the regression coefficients.
# ridge model
ridge_tune_grid <- expand.grid(
"alpha" = c(0),
"lambda" = seq(0.025, 0.5, by = 0.025)
)
set.seed(857)
ridge_fit <- train(
logTotalValue ~ . -TotalValue,
data = data_train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneGrid = ridge_tune_grid,
trControl = fit_control
)
Opposite to Ridge, to create a Lasso model we have to set the alpha parameter to one. In case of Lasso the regularization parameter (lambda) is the weight of the penalty on the absolute value of the regression coefficients.
tenpowers <- 10^seq(-1, -5, by = -1)
lasso_tune_grid <- expand.grid(
"alpha" = c(1),
"lambda" = c(tenpowers, tenpowers / 2)
)
set.seed(857)
lasso_fit <- train(
logTotalValue ~ . -TotalValue,
data = data_train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneGrid = lasso_tune_grid,
trControl = fit_control
)
We can combine both types of penalties. LASSO is attractive since it performs principled variable selection. However, when having correlated features, typically only one of them - quite arbitrarily - is kept in the model. Ridge simultaneously shrinks coefficients of these towards zero. If we apply penalties of both the absolute values and the squares of the coefficients, both virtues are retained. This method is called Elastic net. source
To combine Lasso and Ridge, caret has to tune the alpha parameter. After optimisation the final values used for the model were alpha = 0.6 and lambda = 0.0001.
enet_tune_grid <- expand.grid(
"alpha" = seq(0, 1, by = 0.1),
"lambda" = union(lasso_tune_grid[["lambda"]], ridge_tune_grid[["lambda"]])
)
set.seed(857)
enet_fit <- train(
logTotalValue ~ . -TotalValue,
data = data_train,
method = "glmnet",
preProcess = c("center", "scale"),
tuneGrid = enet_tune_grid,
trControl = fit_control
)
In order to compare the models the best way is to compare the cross-validated RMSE of each model. In the table below, we can see these results. We can see that the differences are small but it turns out that the simple linear model (using some log scaled predictors) is the best among them.
| CV RMSE | |
|---|---|
| Linear | 0.5067 |
| Ridge | 0.5178 |
| Lasso | 0.5069 |
| Elastic Net | 0.5068 |
In my case, the best model was the simple linear one, however the differences were similar. If the goal is to use the most simple model, that means that we could use Lasso or Elastic Net as well, as they reduce the number of included variables, therefore reducing the complexity. However, none of the reduced the number of variables significantly as the lambda parameter was almost zero in both cases. I also tried adding the “oneSE” parameter, but it still did not make a significant difference.
In the following code snippet, I used caret’s pcr method to make linear regression with Principal Component Analysis on the predictor variables. It’s hyperparameter to tune is the number of principal components to use. By defining a tuning grid I was able to find the optimal number as well.
tune_grid <- data.frame(ncomp = 80:240)
set.seed(857)
pcr_fit <- train(
logTotalValue ~ . -TotalValue,
data = data_train,
method = "pcr",
trControl = fit_control,
tuneGrid = tune_grid,
preProcess = c("center", "scale")
)
In the following code snippet I used PCA prior to estimating penalized models via preProcess with the given number of the optimal principal components found out with pcr before.
enet_tune_grid <- expand.grid(
"alpha" = seq(0, 1, by = 0.1),
"lambda" = union(lasso_tune_grid[["lambda"]], ridge_tune_grid[["lambda"]])
)
set.seed(857)
enet_pca_fit <- train(
logTotalValue ~ . -TotalValue,
data = data_train,
method = "glmnet",
preProcess = c("center", "scale","pca","nzv"),
tuneGrid = enet_tune_grid,
trControl = trainControl(
method = "cv",
number = 10,
preProcOptions = list(pcaComp = 237))
)
In the table below we can see that PCA did not help to achieve a better fit in this case. On the other hand, in the linear case the result is close. By using PCA we are losing the interpretability of OLS, but in ideal cases we might be able to use less variables (which are the linear combinations of the original variables).
| CV RMSE | |
|---|---|
| Linear | 0.5067 |
| Ridge | 0.5178 |
| Lasso | 0.5069 |
| Elastic Net | 0.5068 |
| Linear w PCA | 0.5069 |
| Elastic Net w PCA | 0.5395 |
According to the table above, the best model is still the simplest linear model. In order to evaluate it calculated the RMSE on the test set, which turned out to be 0.500, which is great as it is smaller and close to the cross-validated RMSE meaning that the model is stable.
In this problem use the USArrests dataset we used in class. Your task is to apply clustering then make sense of the clusters using the principal components.
The dataset I am using is the USArrest dataset. On the official RDoccumentations site the description is the following: “This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.”
Therefore my variables are:
Murder: Murder arrest (per 100,000) Assault: Assault arrest (per 100,000) Rape: Rape arrest (per 100,000) UrbanPop: The percent of the population living in urban areas
After looking at the distribution of the variables I decided to not make any transformations on them as only Rape was slightly right-skewed and the others had kind of normal distribution. That way I am able to stick with a more handy interpretation in the future as all of my variables are in a similar form. Also, I would like to highlight that all the variables are in per 100,000 or percentage, therefore I do not need to add any absolute population variable.
On the following plot it is easy the see the main properties of the variables.
First of all a used the following code snippet, to visualize the evolution of the within-sum-of-squares. This way it is easy to identify a so called “elbow point” which could be the optimal number of clusters. (In this case it is two or three)
Next I used NbClust heuristics which is using several indices in order to suggest an optimal number of clusters. In this case among all indices 9 proposed 2, and 5 proposed 3 as the best number of clusters.
In the following code snippet, I am performing PCA on the original variables and I am getting the first three principal components as I would like to experiment with 3D plotting as well.
pca_result <- prcomp(data, scale = TRUE)
first_three_pc <- as_tibble(pca_result$x[, 1:3])
As the exercise as for it, the following plot uses the first two principal components and shows the clusters on the plane defined by PC1 and PC2. We can observe that the clusters are separated along PC1 similarly to the previous plot when we had the separation along the assault axis. On the other hand, we still have a broad variance in each cluster among the PCA axis.
In the end, I was curious about the 3D case using the first three PCs. In the following 3D interactive plot we can see that both three variables have great variance in the whole data, meaning they contain a great amount of information.
In this exercise, you will perform PCA on 40 observations of 1000 variables. This is very different from what you are used to: there are much more variables than observations! These are measurements of genes of tissues of healthy and diseased patients: the first 20 observations are coming from healthy and the others from diseased patients.
In the following code snippet, I am performing PCA on the dataset with scaling features.
pca_result <- prcomp(genes, scale = TRUE)
By using the fviz_pca_ind function I plotted the data points on the plain of the first two principal components. On the figure, it is clear, that the healthy and non-healthy observations are greatly separated among the first PC.
From the previous plot, we can see that PC1 matters a lot regarding the separation of the dataset into two groups. By the fviz_contrib function, we can visualize the top 5 loadings for the first PC. These are the ones, that are the most important variables in PC1 (having the largest weight in the linear combination of all original variables). We could have used the rotation matrix also, to find the most important variables, the fviz_contrib function gives the same.
fviz_contrib(pca_result, "var", axes = 1, top=5)
We can see the two most important features (V502, V589) and we can use them, to plot the observations in the coordinate system defined by these two original features. On the plot it is clear, these two variables are truly important as the data points are well separated on this plain already. However, it is important to highlight that on the PC1-PC2 plain the separation was more straightforward.
PCA thus offers a way to summarize vast amounts of variables in a handful of dimensions. This can serve as a tool to pick interesting variables where, for example, visual inspection would be hopeless.